---
title: Assess data quality during EDA
dataset_name: N/A
description: How DataRobot performs Exploratory Data Analysis (EDA) and how to assess the quality of your data at each stage of EDA.
domain: platform
expiration_date: 10-10-2024
owner: izzy@datarobot.com
title: Assess data quality during EDA
url: docs.datarobot.com/docs/tutorials/prep-learning-data/assess-data-quality-eda.html

---

# Assess data quality during EDA {: #assess-data-quality-during-eda }

In this tutorial, you'll learn how DataRobot performs Exploratory Data Analysis (EDA) and how to assess the quality of your data at each stage of EDA&mdash;*EDA1* and *EDA2*.

Preparing your data is an iterative process. Even if you clean and prep your training data prior to uploading it to DataRobot, you can still improve its quality by assessing features during EDA.

## Takeaways {: #takeaways }

This tutorial explains:

* Exploratory Data Analysis, including EDA1 and EDA2
* How to add your data to DataRobot
* How to use the Data Quality Assessment tool
* How to evaluate feature importance

##  Stages of EDA {: #stages-of-eda }

During EDA, DataRobot performs Data Quality Assessment. The assessment provides information about data quality issues that are relevant to the stage of model building you are performing. Click one of the following tabs to learn about the two EDA stages.


=== "EDA1"
    <br>
    EDA1 (data ingest) occurs after you upload your data. EDA1 assesses the **All Features** list and detects issues like:<ul><li>[Outliers](quality-check#outliers)</li><li>[Inliers](quality-check#inliers)</li><li>[Excess zeros](quality-check#excess-zeros)</li><li>[Disguised missing values](quality-check#disguised-missing-values)</li><li>[Inconsistent gaps in time series projects](quality-check#irregular-time-steps)</li></ul>

=== "EDA2"
    <br>
    Once you click **Start** on the **Data** page, DataRobot performs another round of EDA. During this stage, DataRobot detects [target leakage](quality-check#target-leakage) and non-linear correlations between the features and the target, which helps you analyze [feature importance](#investigate-feature-importance). EDA2 reports on the selected feature list. If a feature list is not selected, EDA2 reports on the default **All Features** list.


## Load and view your dataset {: #load-and-view-your-dataset }

As soon as you load your dataset, DataRobot performs EDA1. In this phase, DataRobot generates summary statistics based on a sample of your data.

1. Import your dataset.

    ![](images/tu-eda-import-dataset.png)

    To do so, drag a local file to the **Begin a project** page, browse for a **Local file**, or import from an external data source or URL.

    DataRobot uploads the dataset, creates a new project, and performs an initial EDA. View the progress in the Worker Queue on the right.

    ![](images/tu-eda-worker-queue.png)

    ??? tip
        To learn how DataRobot handles larger datasets, see [Fast EDA](fast-eda#fast-eda-application).


2. Once you import your data, click **Explore the data** or scroll down to see the features in your dataset.

    DataRobot displays the features and provides summary information and statistics.

    ![](images/tu-eda-scroll-to-features.png)

    |   | Label | Description |
   |---|---|---|
   | ![](images/icon-1.png) | **Var Type** | The data type DataRobot identifies for the feature during EDA, for example, Numeric, Categorical, Boolean, Image, Text, and special features types like Date. |
   | ![](images/icon-2.png) | **Unique** | The number of unique values for the feature. |
   | ![](images/icon-3.png) | **Missing** | The number of missing values for the feature.|
   | ![](images/icon-4.png) | **Mean, Std Dev, Median, Min, Max** | DataRobot calculates these statistics for numerical features. |

    The sample dataset featured in this tutorial contains patient data.

    ![](images/tu-data-dataset.png)

    The goal is to predict the likelihood of patient readmission to the hospital. The target feature is `readmitted`.

## Assess data quality after EDA1 {: #assess-data-quality-after-eda1 }

EDA1 helps you catch data issues before you start modeling.

1. Above your feature list and to the right, click **View info**.

    The Data Quality Assessment dropdown menu displays.

    ![](images/tu-eda-DQA.png)

    ??? tip
        The Data Quality Assessment provides the following issue status flags:<ul><li>Warning ![](images/icon-warning.png): Attention or action required.</li><li>Informational ![](images/icon-info-dq.png): No action required.</li><li>No issue ![](images/icon-ok.png).</li></ul>

2. Optionally, click **Filter affected features by type of issue detected** and select particular issues to search for.

     ![](images/tu-eda-dqa-filter.png)

3. Scroll down to locate the features with issues.

     If a feature has an issue, the issue flag displays in the **Data Quality** column. Hover over the flag to view the type of issue.

     ![](images/tu-eda-select-outlier-feature.png)

4. Click a feature that displays an issue flag, then use tools such as the Histogram, Frequent Values, and Feature Associations to explore further.

     See [Learn more](#learn-more) for tutorials that show how to use these tools.


## Assess data quality after EDA2 {: #assess-data-quality-after-eda2 }

EDA2 kicks off after you set your target and start the modeling process.

1. Under **What would you like to predict**, enter your target feature.

    ??? tip
        You can keep the mode set to the default, **Quick** autopilot, or you can select a different [modeling mode](model-data#set-the-modeling-mode). You can also customize your [modeling settings](model-data#customize-the-model-build).

2. Click **Start**.

    DataRobot performs a number of processing steps. Monitor the steps in the Worker Queue.

    ![](images/tu-eda-eda2-worker-queue.png)

    As soon as DataRobot finishes analyzing features, you can take a look at feature importance. DataRobot continues with blueprint generation.


## Investigate feature importance {: #investigate-feature-importance }

The importance bars show the degree to which a feature is correlated with the target. Importance is calculated using an algorithm that measures the information content of the variable. This calculation is done independently for each feature in the dataset.

Investigate feature importance to determine which features are most useful for building accurate models and which features you can remove from your training data.

1. In the **Data** tab, scroll down to the feature list.

2. Take a look at the **Importance** column.

    The green bars indicate how closely a feature is related to the target.

       ![](images/tu-eda-importance.png)

    You might want to remove features that are unrelated to the target.


## Learn more {: #learn-more }

**Related tutorials**

* [Analyze features using histograms](analyze-features-using-histograms)
* [Analyze frequent values](analyze-frequent-values)
* [Analyze feature associations](analyze-feature-associations)
* [Work with feature lists](work-with-feature-lists)

**Documentation:**

* [Data Quality Assessment](data-quality)
* [Data upload overview](import-data/index)
